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PRfeFACE 

1 don*i ^'believe*- injesis. Tests are like carpel lacks; ihey're not a fit subjeci 
for belief'or disbelief. For some ihing.sj tests are useful; for oiher things, you 
need carpet lacks. Rarely, however, do'you have to choose beiween the iwo. If. 
students could learn effectively, withoul wasting lime and eiTort, and if we 
could be ceriqin that they had learned those things thai we (or they) wanted 
them to learn, I would be perfecily happy if they never tpok a test. The pur: 
pose of measurement is to (issist the process of instruction. In order to under- 
stj\nd that purpose, we do not^needto look at measurement; we need to look at 
instruction. So let's do that. V^^ 

What i want to examine here is the instructional process as it exists in ciass- 
rooins. Most folks call that teaching, bui^as^a lorrncr seventh-grade teacher, | 
know better, Teaching encompa.sses whaj the dictionary defines as an 'Mnstruc- 
tional process in classrooms.'? It also involves other things: Parent confer- 
ences. Lunch money. Martin punching out Harold at recess. Gum. The vice 
principal who smokes cigars in the teachers' cafeteria. The.se are all wonderful 
things and an integral part of teaching; they are not, however, instruction, 
which happens sometime after the pledge to the flag, the loudspeaker 
announcements, and sundry other classroom ablutions, but before lunch and 
recess jack ypur kids up into near frenzy. Daniel Lortie, in an excellent and 
affordable paperback book called Schoolteacher,* says that when teachers are 
asked what a good day at school means to them, they usually reply that a day 
when they actually, get to flo some teaching is a good day. A bad day is a day 
that is torn apart with constant interruptions. I agree with this perspective and 
am not going lo propose testing children three times a day. But 1 do believe 
that testing can be an effecii' e, efficient, and nonihreatening method of gath- 
ering information for making instructional decisions about children. In the 
following pages, 1 will present a perspective on measurement and the process 
of instruction as well as some clarifications and suggestions concerning the 
field of measurement. 

J eff rpy K. Smiiii 

December 1979 Rutgers University 
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' L INTRODUCTION 



Teaching is an ilNdefined art. There are aspects -to teaching which, whiJe we 
understand them implicitly, are rarely made explicit. An example of one such 
aspect is that an agreement exists between the teacher and the learner that 
teaching should lake place, This agreement is often stated formally, as in an 
apprenticeship: The apprentice agrees to work for the craftjjman in return for 
the training he/she receives. In classroom leaching at ihg elementary and sec- 
ondary levels, the agreement -is not formal; il is not even • ntered into volun- 
tarily by the student. Often it is the task of the teacbor to work at maintaining 
this agreement. We' usually think of this as *'mfyiivaiion," but it can just as 
- easily be conceptualized as an agreement between teacher and learner that the 
teacher has something worthwhile to leach, and the learner is willing to make 
an effort to learn. 

Another aspect of leaching not usugily considered is that the teacher needs 
to know where the student is. on seyeral lev^els. with respect to the content of 
instruction, in order to make decisions on how to proceed. Not only do we 
need to know the approximate^grade level of a student (in order to find, say, 
the right basal reader), but syc also need to know if the student is having trou- 
ble with some vocabulary .or perhaps with the syntax in a passage. As trained 
educators, we can pick up cues on the more microscopic elenients needing eval- 
uation without resorting to formal procedures. Sometimes, however, our need 
for information al?oui students requires moving to somewhat more-formal 
measures (si^ch as; quizzes or worksheets) and sometimes to quite-formal mea- 
sures (such as the Illinois Test of Psycholinguistic Ability). 

Some of th^ latitude that teachers have traditionally possessed, with respect 
to the level of formality necessary foF obtaining information on students, has ^ 
been removed by local, state, and even federal autljprity. Disirictwide testing 
programs, statewide testing, and federal mandates such as Title I regulations 
and Public Law 94-142 have substantially increased the amount of formal 
ex'^iluaiion taking place in classrooms. -Organizations such as the Nation^ 
Educational Association have oppo.sed this shift. Rather than slating my posi- 
tion on this situation at ihis point, I \vould prefer to let my view evolve over the 
course of this paper. ^ -.^^^^^ 

Irrespective of one's position, it has become necessary for teachers {an^ 
parents) to become increasingly aware of the process of evaluating students' 
\ learning and to become more sophisticated users of formalized evaluative tech- 
niques. The purposes of this paper are to increase awareness of and promote 
sophistication about the roie of measurement in ihc process of instrpviion. 
This paper is intended primarily for classroom teachers. It is also inteno^d for 
those adminisuaiors, school board members, reading coordinatcis, and 
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teacher association members who are Involved in the selection of standardized 
t^sls. It may also be useful for parents, but that audience is not of direct con- 
cern here. . '' 

Why this paper and why now? There are literally hundreds of tests-and-mea- 
suremenf publicat^s available (Buros, 1978). This one is ba.sed on several as- 
sumptions about the reader that I believe make it especially useful for the prac 
ticing classroom teacher. The assumptions are tliat: 

1. The reader is an Intelligent and dedicp.ted professional who sees a need for 
the continuing education of all professionals, including teachers. 

2. The reader is concerned" about the testing and other information gathering 
thai the reader engages in and that Others mandate him/her to do, j 

3. The reader is inherently more int/ested in Shakespeare, science demonstrj^- 
tions, and toothless smiles on small faces than the standard error of meU- 
surement. , _ I 

4. The reader has a limited amount of time that he/she wishes to devote to ll|is 
topic. I 
The sections that follow include di.scussions of measurement and classroom 

instruction, standardized tests and testing terms, and some considerations |'or 
constructing your own tests. The fifth .section contaii.a some final thoughts. 
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II. MEASUREMENT AND THE 
PROCESS OF INSTRIJGTION 




**AII the world is a s/age" is a particularly usefuUconcepl of life for a play- 
wright. As a measurement specialist, however, 1 would rephrase Shakespear * 
as follows: / 

All the world is a 

A. Series of evaluations. 

B. Multiple-choice test, 

C. No, 2 lead pencil. 

D. Trick question. 

E. None of»the above 

The nonsense above serves well as a caveat for this section: Anybne looking 
for a rationale for the use of measurement in instruction should l?e suspicious 
of measurement specialists. (Measurement specialists Irked taking tests as chil- 
dren.) Therefore, do not accept what follows simply because il comes from ^ 
measurement specialist; if the arguments are not persuasive"^ use your juclg- 
ment as a professional educator. Later, in the discussion of the specifics of 
measurement, 1 will occasionally ask you to take my word for something, but 1 
am not going to ask that of yo\i now. 

The Nsture of Inslruction 

Even in the simpkst of settings, instruction is highly complex. Fortunately, for 
our purposes here, we do not need to address all the aspects of instruction. We 
only heed to look at instruction as it relates to measurement. 

To begin, have you ever wondered why it is so much easier to explain some- 
thing to someone in person than it is to write the'explanacion? In part, of 
course, il^ is easier because one can demonstrate in person and cannot on 
paper. Equally important, in a person-io-person situation, the teacher can see 
and hear the student's reactions— a nod of the head, a quizzical look, a correct 
answer to a question, or the words *M don*t understand.'' The diversity pf the 
information the student can communicate easily in this one-on-one setting is 
impressive: 

« 

^ 1. I don't understand thai. 



2. Could you say thai again? 

3. Could you give me an example? 

4. 1 already know this, let's move on. 
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5. Could you show mc some other way? 

6. What does that word mean? 

7. Can you relate this to something I can understand? 

8. Could we go slower (faster)? 

9. Le( me try it to see if I understand. 

10. I need some practice to ma)ce si^re 1 can do this, 

1 1 . This is the way I learn best . 

12. [understand. 

These questions and statements are not instruction in and of themselves; 
they facilitate the process of instruction. If one looks carefully at the list, one 
can see that certain statements occur early in the process of instruction (or even 
before instruction begins), some occur typically during instruction, and some 
at, or near the endoX instruction. 

From this, we may be able to extract a general principle* concerning instruc « 
tion: ^Throughout the process of instruction, information gbout the learner 
and his/her learning facilitates instruction/*' A diagram of thi^ principle is pre- 
sented in Figure 1, 

Figure I 

Informition Concerning Learners and Learning 

f /A 
How do /ou learn best? 

What do you already Icnow? 

Do you understand this presentation? 

Should we move a little slower? 

See if you cando.ihls. 

You need more practice here. 

Can you apply this to another content? 

It seems we're ready for the next unit . 



000 00 




Pre- 
Instruction 



Presentatfon^ 
of Material 



Review 
"of Material 



Conclusion 
of Instruction 



The concepluaiization of instruction presented in Figure 1 is a temporal one; 
that is, instruction is abstracted as a process with a beginning, middle, and 
end but without any content. The point is that the content of ^e instructional 
process is somewhat independent of the need for information about the 



♦Contention may be a better word here than principle." 
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leprner. Regardless of what one chooses to teach, informatiop about the 
learner and his/her learning is helpful. * 

i * n 

Gathering Information Informally 

•Up to this point, the discussion has been limited to a situation wit|i one teacher 
and one student. In such a setting, njost of the desired information can be 
obtained through interpersonal communication. But we are rarely fortunate 
enough to have the opportunity to teach .one student at a time. So, as we begin 
to talk about "gathering information for purposes of instruction; it would be 
useful to.discuss instruction In the group, rather than the indivlduaj, setting. 

In the discussion of the individual setting, all ej^mples of informatTT>n gath- 
ering were informal. But certain types of inf6rmation (concerning such condi- 
tions as learning impairment, visual or hearing difficulty, and aphasia, for 
example) are rather dilFficult to assess without standardized procedures. When 
working'with a group of individuals, the gathering of data .solely on'&n infor- 
mal basis poses several problems. Hpre are four examples. 

Some Problems in Gathering |>ala Informally on CJroups of People: The first 
problem is inefficiency. *lt simply is not possible to assess the progress of 5 or 
25 pupils using the same method one would use for a single pupil. 

The second problem is inaccurate information. If 8 out of 10 heads nod in 
response to a comprehension-type question, it may be concluded that the 
group is ready to'proceed when, in fact, several children may not be ready at 
all. Several of the nodders may be trying to please the teacher; one may have 
misunderstood the question. These possibilities also exist, of course, in a one- 
on^one situation; it's just easier to catch them^ on an individual basis. 

The thir.d problem is incomplete data, time considerations limit the fre- 
quency of informal information gathering and the number of pupils on whom 
information can be obtained. For example, if one. were interested in assessing 
multiplication facts, it would be difficult to get a complete assessment that 
would allow for pinpointing weaknesses if it were done on one pupil at a time 
for 25 pupils. 

The fourth problem— that of bias— may be' the most .serious of all. The 
' problem of bias is slightly different from the problem of inaccuracy or error. 
. Error is simply bein^ off the mark, sometimes high, other tjmes low, but not 
consistently one or the other. Bias occurs wh^n vye err consistently in one direc- 
tion—tor example, wl^en we continually believe a student cah do things that 
he/she, in fact, cannot do or when we consistently sell a student short. It can 
occur within groups of students and against individuals. Racism and sexism 
are not the only causes of bias; it can occur against "the low reading group," 
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the **morning group*' (as opposed to the **afternoon group"), or even the 
'*kids in that corner." * * 

II should be remembered t,hat the bias we are talking about h not overt, bla- 
tant, discriminatory behavjor. This kind of bias has to do with misinterpreta- ' 
tion of subtle communications, quite subconscious. The bias can stem from 
the best of motivations and may be equally harmful to the pupil whether it 
takes a negative or a positive direction. 

To summarize; (1) There exists a need for information about learners and 
their learning in order to facilitate instruction; (2) the nature of the informa- 
tion can be quite diverse; and (3) informal information gathering suffers from 
inefficiency, inaccuracy, incompleteness, and bias. 

It would be nice to be able to say that all the problems inherent in the infor- 
mal process could be solved by more-formal procedures. Unfortunately, that 
is. not the case. However, some problems can be ameliorated. We .will look at 
these more-formal procedures next. 

Gathering hiformation Formally 

Formal is an unfortunate word. Formal weddings, formal dinners, and formal 
affairsvdo hot sound nearly as interesting as lueir />7formal counterparts. In the 
context of gathering information for evaluative purposes, ^'formal'^' simply 
means that the information was gathered in a systematic fashion, following 
procedures that have proved to be useful. We don't mean stuffy and rigid, we 
do mean not offhand or lackadaisical. In this "-^ction, we will be discussing the 
gathering of information in a formal fashion, predominantly through the use 
, of tests. ' - 

The perceptive reader will have iioted that use of the term *'test" has been 
assiduously avoided up to this point. The reason for this is that. testing is ? 
loaded concept in American education today. For students, lesting is equated 
with being'ranked and graded; for teachers, testing is equated with account- 
abiUty and loss of classroom time; for the public in general, lesting is equated . 
with tiie stratification of individuals based upon inaccurate and biased esti- 
mates of narrowly defined cognitive abilities. Thus, testing has a bad name, 
perhaps deservedly, but bad in any case. Even if this were not the case, tesiing 
should be viewed only from the perspective of its contribution to instruction. 
We have, therefore, avoided the term. Now, however, we must begin to use 
and define ^'testing" and some similar terms so that we can be more precise in 
the discussion that follows. To begin, there is the concept of evaluation. 
'*EvaluaiJon" is a very broad term that can be applied to the activities of an art 
critic or an auto mechanic. A variety of definitions might be given for evalua- 
tion, but since we'are using the tei'-m in an educational context, we'll use an 
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educational definition. "Evaluation is (he process of gathering information 
for the purpose of making educational decisions." The word ''process" is very 
importani here. It includes the determination of what^ needs to be known, how 
the information will be gathered, what importance Mil be, assigned to various 
pieces of information, and how the information will be used. 

The nexl term we encounter is **measuremenl." It is far more liniiled in 
scope than evaluation but broader lhan lesiing. ''Measurement is the process- 
of making a quantitative abstraction of a characteristic of an inflividual or 
other object (such as a classroom)." 

By ^'quantitative absiraclion" we simply mean that a characteristic that ex- 
ists in the real world, such as height, home-run hitting ability, or reading abil- 
ity, is expressed as a number. The number is an abstraction because il doesn't 
carry with it all of the richness of the original characteristic. Height is a well- 
defined characteristic, is easy to measure precisely, and does jiiot lose much 
when turned inio a number. Home-run hitting ability is more of a problem. 
(Should it be defined as home runs in a lifetime? The number of home runs per 
time at bat? Or the number of home runs per pound of ihe baisman? Using 
these different definilions, we obtain different people as outstanding home- 
run hitters.* Once defined however, home-run hitting ability is easy'to 
measure. 

Reading ability is especially. difficult to measure. First, it is 'difficult to 
define. Second, we donM have a universally accepted metric for it (although' 
grade equivalent is* very popular). Third, it isn't directly observable. To be 
sure, we can observe people reading^loud, but that isn't really what we're 
interested in. What we are interested in is something that exists insicje one's 
head (mind, brain). To measure it, we provide tasks that we believe will require 
a person to demonstrate the ability we are interested in. It's a Tricky business. ' 

To measure menial abilities, characteristics, or states of being, we Often 
resort to *'tests." Our third definition: "A test is a (ask or series of tasks with 
observable results which are combined and used to estimate an ability, charac- 
teristic, or state of being in a person," Tcsis are usually paper-and-pencil activi- 
ites. Of course, this isn't a necessity; a road test for a driver's license is a.gobd 
example of a non-paper-and-pencil test. The Stanford Binet is a test; so is a 
quiz. The typical task that we present to students is a question, which testing 
people refer to as an *Mtem," since many "questions" do not, in fact, have a 
question mark after them (true-false items, for example). 

We'have spent some time differentiating the terms **evaluapon,',' ^'measure- 
ment," and '*test." In providing these definitions, we have gained an appropri- 
ate perspective from which to view testing. As teachers, we are not essentially 
interested in testing at all! We are interested in evaluation. ^ 



♦Probably Henry Aaron, Babe Riiili. and Ernie Banks, r^pectively. 
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Look again, carefully, at the definitions. If evaluation could be accom- 
plished without measurement or testjtig, there would be no qeed for either'. 
'When informal gathering is sufficieifit, there is no need to resort to more- 
formal procedures. U'nfor,tunately, informal procedures are, all too fre- 
quently, not sufficient. Recall the four problems concerning informal proce- 
dures that were discussed earlier: • ' '* 

1. inefficiency 

2. inaccuracy ^ ' 9 : . / 

3. incompleteness ^ ^ 

4. bias \ ■ ' . , /' 
The question we must ask here is: Can more-formal procedures, such as test- 
ing, alleviate the. four problems mentioned above? The answer is: Som^^times. 
Let's look at the four issues separately. ; y 

Inefficiency: Inefficiency seems to be the problem area testing can jielp most. 
As mentioned previously, it takes no more time to assess multipjication-fact 
knowledge for 25 pupils'than it doep for one (at worst, not much;hore iime).'"A 
paper-and-pencil activity can almost always be conducted effiyienily in group 
settings. Some may argue that the, information gained through testing is not 
worthwhile and, therefore, the ecpnomy of testing is a /^Ise one. We wili< 
address ttt«LUsefulness of the infoi^mation later; in terms/of efficiency alone, 
testing is a good proposition except when one is in the actual prbcesg of 
presenting -maitrial to students. The pace and flow of Instruction is critical to 
classroom lining; only in unusua(,cases should this pace be broken by, say, a 
classroom quiz. In order to check class progress during instruction, question- 
ing techniques should be employed. However, at the beginning or end of a pre- 
sentation, a short quiz that is not-for-grades (more on this later) is an excellent 
and nonihreatening check on student comprehension. Inefficiency is a prob- 
lem that is frequently resolved'th rough testing. 

Inaccuracy: Inaccuracy is a tougher problem to overcome than inefficiency. 
Measurement people usually deal with inaccuracy under the heading? of valid- 
ity and reliability, which are discussed in the next section. For now, let us con- 
sider inaccuracy from a narrower perspective. If a teacher can construct sev- 
eral tasks which he/she feels will require the student to use skills or facts being 

• taught, and if a student responds propei^y to those tasks, an inference of com- 
petence would seem appropriate. In a one-to-one setting, this can be accom- 
plished orally. In'Vgroup setting, this is quite difficult to do'informally since 
one cannot go from pupil to pupil requiring the same set of tasks to be per- 
formed. , ' ^ . ' 

■ What is often substituted for a thorough one-to-one assessrtient is. an oral 
assessment of the class as a whole. This leads to improper inferences about 
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indivi^ pupils. Simply becauseJohnny responded correctly and Mary 
-nodde- »ile Johnny answered does not mean that Mary understood. The 
essential \.o\ni here is that what can be accomplished informally on a one-to- 
Dne basis can be accomplished /orma//v with a group of pupils if we simply 
provide some structure for the process. Asking a pupil several questions orally 
and having'him/her respond orally is not sd different from asking 20 students 
the same qiiestion^and having them write their answers. (Interactive Cqllow-up . 
can occur in the group setting once the evaluation has been completed.) When 
this process is completed, the teacher will know how competent all students are 
instead of one. Thys, our information on everypncis based on data and not 
' inferred from smiling faces and nodding heads. * 

Incomptetenesis: Testing is very useful in coping with problems.of incomplete 
ihformation. In developing a unit quiz, for example, a teacher can first outline 
all. of the relevant aspects of instruction that he/she wishes to cover (this is dis- 
cussed in Section IV). In this way, all students are exposed to all relevant tasks. 
Thus, our information is not only more accurate but more complete. 

Bias: The issue of bias is coinplex since bias affects various sub-populations 
differently. Here we will compare the bias produced by formal evaluative tech- 
niques .with that produced by informal evaluative techniques. In testing, we 
often encounter bias in the way we interpret the results of a test. For example, 
we infer lack of intelligence when, in fact, a low lest score may be due to lack 
of familiarity with'the culture on which the test is based. Two students with 
identical scores may be quite different in the ability being tested. 

Informal evaluation allows for the: possibility of ameliorating this bias 
through a sensitive modification of the ciata gathering. That is, one can trace 
apparent inabilities to their, causes and, thus, produce results that are less 
biased. On the other hand, this may produce even more^bias in the final analy- 
sis because of expectation of ability or inability, which is a problem of infor- 
mal techniques. Although bias is still considerable and perplexing in formal 
procedures, the potential for bias projjably looms even larger in informal 
assessment (especia^y when groups of students are under consideration). 

Some Recommendations for Using Measurement 

Having presented several pages of apologia, for the use of tests, let me now pre- 
sent a set of recommendations concerning the testing of students: 

1. Dpn'l use tests for grades. 

v.: 

2. Don't have students review for tests. 

3. Give stqdents their results imnnediaiely, 
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The key to understanding these statements is this: Learning is what is impor- 
. tant; testing is only there to' help. . 

Let*s examine the three points one St a time. 

Have you ever wondered why people love riddles and Crossword puzzles and 
hate tests? I would contend it has nothing to do with the activity itself but with 
its consequences. If grading and testing received a divorce on the grounds of 
mutual incompatibility, then maybe people would rtot be so averse to taking 
tests. Of course, this raises the issue of how to assign grades, which is a ** whole 
other topic.'' If you must use tests for grades, then simply use a midterm and a 
final; the use of quizzes or quarterlies is counterproductive. Grades are not 
important; learning is important. 

This brings us to the second point. Don't have students review Tor tests; 
have them review because of tests. That is, let the test results guide you and 
your stjtfdents in the review of material. 

Tfffrd, we should provide students with resulis immediately. How? Simple. 
On short-answer, multiple-choice, or true^/false tests, it- is easy to have stu- 
dents record two copies of their answers. They hand one in and keep the sec- 
ond, and the class discusses the test that day or the next. This is important. 
Students who have just completed a number of mental tasks need information 
about their performance. Problems must be worked out and misperceptions 
cleared up before they are set. 

A Model of Measurement and Instruction 

These recommendations fit into a general model or framework of instruction 
and measurement that can be presented by posing the following series of ques- 
tions concerning instruction that measurement can answer: 

Where do we begin? and How do we proceed? Before instrujlion begins, we 
need to know where students are and how we can help them to proceed. There 
are a variety of ways to gather the necessary information to answer these two 
questions. There are existing records concerning prior performance: teacher 
recommendations and comments, grades, and previous test scores.* A teacher 
may ulso take an inventory on a student at the beginning of the year, having 
the student read aloud, work math problems, and the like, in order to assess 
the student's current status. Finally, the teacher can administer standardized 
tests to the class as a whole. It is important to make sure that these tests are 
thorough enough to provide useful information for instructional purposes. 
(Tests with reliabilities of under ,'90 rarely meet this criterion.) On what basis 

•Previous tesi scores can be particularly useful if ihe> are presented \o ihc teacher in a useful 
.fashion, All too often, test scores are organized by the ieachcr*s roster of the year before. Stan- 
dardized test scores should be reorganized into the current teacher^ class. This could be accom- 
plished by lest publishers or scoring services over the summer. 

10 



should the teacher decide to rely on any one of ihe three suggested sources of 
information? Basically, the teacher should ask, ''Do 1 have enough informa- 
tion to plan instruction for this child?'' If not, aij informal or forma! proce- 
dure should be employed. How do you choose between these options? Well, 
for how many students do you need additional information? How comfortable 
are you with your informal techniques? If you have more than a few students, 
or if you rfre not comfortable with informal procedures, then you are probably 
better off with a formal procedure of some type. 

Soon after instruction has begun, the third question will arise: How .much 
have the students learned so far? A corollary question is: 'Are we ready to 
move on?" Usually this question can be answered with a quiz (which measure- 
ment people iLall a formative te^t"). One aspect of classroom instiiuction that 
is almost as universal as it is counterproductive is having students study for a 
quiz. Let the quiz vQSuh^ guide their study, noH motivate it. 

At the end of the academic grading pepod, semester, or year, wp ask, *'How 
did wc do overall?" Frequently, there is a concomitant need for assigning 
grades. At this point, a final exam or test (what measurement people call a 
**summative test") might be appropriate. 1, personally, don't like to see grades 
tied to lest scores, but I must admit I don't have much in the way of alterna- 
tives— not without engaging in philosophical discussions about the nature and 
purpose of American education. At the end of the year, it is often useful, for 
systemwide evaluation purposes, to administer a standardized test of some 
type. What is often ignored in school systems is the value of these scores in 
planning for the next academic year. (See the footnote on p. 10.) 

Tcsummarize the perspective on measurement and instruction presented., 
here, the following statements are appropriate: ^ 

\k As educators, we are interested in instruction and learning, not in testing 
per se. 

2. Evuluaiion is an essential aspect of the insiruciional process. 

3. Tests, as a form of measurement, can be very useful in evaluation. Tliey 
have properties thai often make them more useful than informal proce- 
dures. 

4. We should use tests for a clearly identified purpose, and when we do use 
tests, we should be certain thai testing is the best way to obtain the infor- 
mation. 

5. Many of the problems of test anxiety arc not attributable to tests but to 
the consequences of tests, which can be modified. 

Having concluded this discussion of the role of measurement in instruction, we 
leave (he arena of persuasion and move into those areas more directly concern- 
inu technical expertise. 

// 
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ill. UNDERSTANDING STANDARDIZED TESTS 

In this section, I will provide you with some assistance in selecting and con- 
structing tests. We will look at how to select a standardized test, 'interpret its , 
scores, and read through measurement jargon. Then, in the next section, we 
will run through some practical steps for test construction. . 

Selecting a Standardized Te5t 

There are literally thousands of standardized tests available to educators: some 
of them are quite good; many of then are very poor. The question that occurs 
to the practitioner is, ''Should I use a standardized test, and if so, which one 
should I use?" Actually, there is a question that should arise prior to that one: 
"What do 1 want to measure and why?" 

There are two reasons for asking this question, The first is that it will help 
you decide whether you want a standardized tepor a locally constructed one 
,(or an informal procedure). Second, it will jelp you decide which standardized ^ 
test*you want, if that should be your choice. But first we need to address the 
issue of standardized vs. locally constructed tests. ' : " 

The classroom leacher ma^y say 'at this point, "All of these decisions are out 
of my hands." That may be true at the individual classroom level, although 
there are hundreds of tests designed specifically Tor classroom use and many 
. administrators would be willing to purchase such tests if a solid argument for 
their use could be made. Also, most school districts include teacher representa- 
tives in test-purchasing decisions. It may well be the case, however, that the 
next section is more appropriate for administrators. 

*. " • ■ 

Advai\tages and Disadvantages of Standardized Tests 

Although standardized tests have come under aH:onsiderable a lOcwi of well- ^ 
deserved fire recently, t1iey, in fact, have many advantages when compared to 
locally constructed tests. Consider the following: 

1 . They are already written and require no*teacher or staff time for develop- 
ment. 

2. Most of them were written by professionals and have some evidence of 
quality. 

3. Most of them yield scores that you can compare with those of another 
group. 

12 
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Of course, standardized tests have some drawbacks: 



1. They cost monpy. 



2. Many of ihem are quite poorly developed and. have little evidence of 
quality. 

3. Most of them are not exactly what you had in mind . 

4. Many of them provide inappropriate norm groups. 

So how should one make a decision? Get help. This is a fairly complex activ- 
ity, and a few dollars spent for a consultant (not a publisher's representative) 
can make a substantial difference in the utility of your testing activity. Here 
are a few questions you will have to answer; 

1. How important is it for you to compare your grdut> with a,norm group? 
Norm dfila help with Title 1 evaluation or for convincing board members . 
or the public at large of school progress. You can't do that with your own 
test. 

2. How important is it that the test correspond with your curriculum? Very ' 
few standardized tests will oiatph your curriculum as dosely as you would 



3. How communicable do iheVesults need to be? This concern is similar to 
the first. If several individuals or groups need to use the results, a test 
that is widely known has some advantages. 

4. How technical or complex is the trait you are measuring? It is difficult 
for a local school district to develop, say, an early-screening device for 
learning problems. There are, however, some notable exceptions to this. 
(SeeNaron. 1977.) 

Having presented these guidelines, let me reiterate: Gel help. Call your local 
college or university and ask for some assistance from the faculty. Some states 
have agencies that are designed to help school districts. You might also try 
your state's department of education. 

Before ybiTfinally decide on a standardized instrument, be sure to look up 
the test in Buros* Mental Measurements Yearbooks, These reference bopks 
have rigorous and very useful reviews' of almost every test under the sun. A 
note on these books: If a test was reviewed in, say, the seventh edition of the 
series, it probably won't be included in the eighth edition. Most tests that are 
worth buying are reviewed in Euros. Keep looking.* 



♦The Menial Measurements Yearbooks have been lakcn over by ihe University of Nebraska atid 
will continue lo be published. 
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What All Those Words Mean 

As you plunge into the world of standardized testing, you are goifig to be inun- 
dated with jargon. Though special terms are often useful, some people in the 
field are deliberately tibscure for the purpose of glossing over weaknesses. This 
is definitely caveal-empior lime. Below is a nonalphabeiical guide to what all 
the terms mean. 

A. Terms related to different kinds and uses of tests 

Standardized Tests: ''Standardized*' means the same test has been given under 
the same conditions to large numbers of people. Just about any test that has 
eyer been given to anybody before someone tried to sell it to you is likely to be 
called ''standardized.'' That is, the term is used rather loosely. What it should 
refer to is a test that provides some standards. or expectations of performance. 
What it usually means is that there was a norming group that was given the' 
' test.* Be wary. Some of the members of the norming groups for. tests you 
might use for first graders fought in the Great War. Mak? sure the norms are 
current and appropriate for the kind of person you wish to test . 

Norm-Referenced Tesis: The meaning here is pretty similai; to that of stan- 
dardized tests. Basically, a norm-referenced test is one that yields scores that 
are interpreted by comparing them with scores earned by other kids on the 
same test. Telling you that Johnny got a 73 on a lest in reading doesn't tell you 
much. Telling you that Johnny is in the top 10 percent o'f all fourth-grade slu- 
. dents is more informative, if you have a sense of how well students at that level 
can read. "Norm-referenced" simply means that scores can be interpreted by 
comparing them with scores of dt her people- 

Cfilerion-Referenced Tests: In 1963, Robert Cliaser (Glaser, 1963) introduced 
the concept of criterion-referenced tests. His basic idea was that sometimes 
you just want to know whether a kid can do something (like learning multipli- 
cation tables) or not, and you don't care how well anybody else does on this. 
This was a nice bit of insight on Glaser's part. There is, however, one large 
problem mih criterion-referenced tests: If you don't really care how well 
people as a group do on a t^sl, it is very hard to assess the quality of the test. 
The reasons for this'are not necessarily conceptual. !t*s just that a criterion- 
referenced test requires measurement people to address the issue of reliability 
and validity in a new light, atid we're still a little blinded by that light. Beware 

♦A norming group is supposed lo be a rcpreseniaiive samnic nf people \s ho art- similar to ilic peo- 
ple for whom \he lesl is intended. A fourth-grade reading test would have as a normmg group a 
sample of fourth graders drawn from a variety of backgrounds. 

14 
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of the person who lells you, **We don'l need validity or reliability on this lest; 
it's criterion-referenced ' 

The basic difference between criterion-referenced and norm-referenced tests 
has to do with how we i;se the scores. In norm referencing, we try to under- 
stand scores by comparing people according to behaviors or performance. 
**Ben is the best bowler in the league," is a norm-referenced statement. For cri- 
terion referencing, the score has a meaning related to the i.est itself, not to how 
others did on it. **Ben's average is , 185," is a criterion-referenced statement. 
The difference is not so much in the tests as in the referencing system. There 
are measurement people who contend that the difference is in the tests. 
Although they may have a point, I think the crux of the issue is in the reference 
system. 

A final note: Some people think that criterion-referenced tests have to be 
short and /or related to the classroom and /or have a cut-off score. None of 
these conditions are necessary, although they may be true of some criterion- 
referenced tests. 

Mastery Tests: Basically,- a mastery test is a criterion-referenced test that has a 
cut scoje attached to it. That is, if you are above a predetermined level you are 
considered to have mastered whaiever the test was about. The written test for a 
driver*s license is an excellent example of this. Usually, however, mastery tests 
are used in instruction, which makes the driver's test a little iess illustrative. Of 
course, people who fail that test usually continue to study, and people who 
pass it burn their copy of Rules of the Road, so it is instructional in that sense. 
If a teacher gave a quiz on multiplication tables, and if any6ne getting more 
than 90 percent right didn't have to study the tables anymore, then the test 
would be a mastery lest. More about cut sc6res later. 

Conlent'Referenced and Domain-Referenced Tests: Any idea that yields a 
glimmer of success in education will be dutifully extended, expanded, and 
elaborated upon until every photon has been accounted for. This is true with 
criterion-referenced tests. Content-referenced and domain-referenced tests are 
marginally idiffereni perspectives on the idea <0f criterion-referenced tests. 
Think of them as one concept, and you will always be within a degree or two of 
perfect accuracy,^ 

Objective Tests /Subjective Tests: This sense of objective means fair and 
impartial; n is to be contrasted with subjective. No test is objective; it only 
aspires to be. Tests are considered to be more objective if: 

1 . An examinee gets the same score from two different graders. 

2. The conditions for tests a^o the same for everyone. 

3. The iiems mean the same ihing lo all people. 

;5 
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The third criterion is the one we can never really satisfy. People have different 
liv'"? and cognitive processes, and the same stimuli c?*n easily mean different 
♦J to different people. Do not despair over ihis .siiuation; even though it 
* n our tests can never be completely objective, it.:s what keeps us from all 
beiiife Calvin Coolidge (that is, the fact ihai we are all different). Hence, the 
price is not loo great. 

Formative Tests/Summatlve Tests: The difference between formative and 
summative tests really lies in how the tests are used. Formative is used, as a 
term,' in the same sense as ''formative years'*; ihere is a developmental or 
instructional aspect to it. A formative test is one in which the results are used 
to mak? decisions about the future instruction of the student. A summative 
test is used. to make a summary statement about 9 student (such as giving him/ 
her a grade). As you can see, it is difficult t(5 ascerlain whether a test is forma- 
tive or summative until you know what is to be done with it. Typically, though, 
such tests as final exams, certifying exams, and quizzes used only for grading 
tend to be more summative, while diagnostic mea.sures and quizzes used for 
instruction are considered formative. 

Dingnostic Tests: Diagnostic tests are a special type of formative test. They are 
specifically .designed to address a question related to a specific aspect of 
instrCJction, such as ''Does this student have visual difficulties?'' or *'Does this 
student need work on letter-sound relations?" The field of diagnostic testing is 
quite large and really deserves more *attention than can be given here (see 
Rapapott, Gill, and Schafer, 1968). 

Cognitive /Affective /Psychomotor: These three terms are convenient ways to 
classify tests according to the types of things they measure. Cognitive tests 
measure people's mental abilities (we'll quibble over aptitude and achievement 
later). Psychomotor measures involve directed physical action on the part of 
the subject in response to the stimulus. Affective measures tap the subject's 
attitudes, opinions, and stale of mind. 

Achievement /Aptitude Tests: Achievement tests are designed to measure pro- 
ficiency in subjects a person has been taughlj Aptitude tests are designed to 
allow predictions of future achievement. Now/ one good way to predict future 
achievement is to look at present achievemerit; therefore^ many achievement 
tests serve well as aptitude tests. Measurement specialists love to argue over 
whether there really is such a thing as aptitude. The issue is not in imminent 
danger of resolution, 

A brief note on the terms related to kinds of measures: A test can be 
. objective/achievement/formative/criierion-referenced /cognitive/mastery all 
at the same lime! A spelling quiz with a score that determined whether a stu- 
dent had to do more spelling work would fit all of those categories, ll is useful 
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to think of a particular test (and usage) and run it through the list of terms to 
see where it fits. , • 

B, Terms related to various kinds of scores 

\Vhen you receive a student's results on a standardized test, you are likely to 
run into any number of bizarre-looking scores. A little later, weMl talk about 
how to interpret them. Here weMl just define them. 

Raw Score: Basically, a raw score js the number of items a student gets right. 
Some testing organizations use what is called '^formula scoring,** which sub- 
tracts a fraction from the number right for each answer that is guessed wrong.- 
Rarely does any scoring system penalize guessing to -the point where one 
shouldn'r^uess. Usually it is an attempt to neutralize guessing.* 

Percent Correct: This is the number of correct answers divided by the total 
number of items. This is no! at all similar to a cenfile or percentile. 

Percentile Rank/Centile Rank: The;se words mean the same thing. A percentile 
rank tells you what percent of the people in the norm group fell below this stu- 
dent's scbre. (for example, a percentile rank of 84 means that 84 percent of the 
norming group, fell below the score.) If the student in question isn't re illy a 
merTib?r°of this group (such as a seventh grader being compared with fourth 
graders), the percentile is somewhat less meaningful, t 

Stanine: Stanine is an abbreviation of **standard nine,*' and is a score reported 
on a scale that is divided into nine segments. Each such score is expressed as a 
number from one to nine. ^ stanine of 1 means that a student's score was in 
the bottom four percent of the norming group; a 2 means the score was be- 
tween the fifth percentile and the eleventh; a 3 means between twelfth and 
twenty-third; a 4 means between twenty-fourth and fortieth; a 5 means be- 
tween forty-first and fifty-ninth; a 6 means between sixtieth and seventy-sixth; 
a 7 means between seventy-seventh and eighty-eighth; an 8 means between 
eighty-ninth andninety-sixth; and a9me'ansthe lop four percent. 

Standard Score: A standard score is a score that has been converted from an 
original raw score, usually by transforming the mean and standard deviation 
of the raw score. SAT scores and .IQ scores are good examples of standard- 
scores. Because it is easier to work with scores that have been converted, most 



♦This is done by subiracnng I /number of options for each wrong answer from lolal correct 
answers. 

til does tell you how the student did compared to fourth graders, if that is of interest to you. 
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test publishers derive their own stancJard scores, The problem with standard 
scores, however, is that they are meaningles& unless the tesi^publisher provides 
information on how to interpret them. ".^ 

Grade-Equivalent Score: Although this type of score^4§^used in many elemen- 
tary schools, it has a number of shortcomings. A grade eqmvaleni of 4.5 ih- 
dicates the average performance of a fourth-grade pupil in th^fifih month, of 
.school. The main problem with grade equivalents is that there is little e^vidence 
to suggest that kids march along one month at a time in their academic 
development. Furthermore, students' average growth, in general, from third 
grade to fourth grade may be much greater than from fourth to fifth,, but we 
treat it as if it were the same. 

Normal Curve Equivalent (NCE): NCEs are a standard score with a mean of 
50 and a standard deviation of 21.06. They were developed to avoid technical 
difficulties associ.-ned with grade equivalents and percentile ranks (you can't 
properly take averages of grade equivalents or percentile ranks). NCEs are 
very similar to percentile ranks, although they tend to be more moderate at ex- 
tremely high and low levels of performance. 



C. Terms related to test quality 

•*Valid and reliable" is a phrase that appears in almost any discussion 6f 
testing. Since these terms frequently involve numbers, it is common for people 
to ignore the evidence that a test publisher presents on these issues >ind just 
look for a concluding phrase something like . . , therefore the validity and 
reliability of this measure is well-established." Since that type of activity 
makes me shudder, I'm going to present a little longer di.scussion than usual in 
this section and explain what these terms really mean. 

Validity; To begin, validity is all we really care about. If a lest is valid, it has to 
be reliable.* The reason measurement specialists talk about reliability so much 
is that it is easier to calculate. Simply speaking, a test is valid if it measures 
what you want it to measure. As is true with so many of our other testing con-, 
cepts, the validity of a test depends upon how it is used. Technically, it is more 
proper to talk about the validity of a particular application of the test than oT 
the validity of the test itself. This borders on pedantry, so we'll go along with 
convention and talk about "the validity of a test." 

Validity is often confused with the evidence oi" validity. A lesji could have no 
validity evidence at all and still be the most valid test ever constructed. It*8 
simila: to guilt in a criminal proceeding. The guilt or innocence of a person ex- 

like the relaiionshipof "aniiciuc*' lo **old." 
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ists regardless of the evidence the DA can muster. As the evidence builds up, 
the jury becomes more and more certain of the guilt of the suspect. However, a 
person could be guilty of something and there could be little or no evidence of . 
it. When validity evidence is presented in a test's technical manual, it should be 
viewed as an argiment for the-validity of the test. The evidence has to be 
weighed and a judgment rendered. 

There is one important difference betwv?n guilt and validity. Guilt is often 
viewed as a dichotomous "(eiiher-or) shuation, whereas validity exists on more, 
of a continuum. That is, it is reasonable to talk about test X being more valid" 
than test Y for a particular use. Let's 109k at some of the types of evidence that • 
people pre.sent for validity. The most common type of validity evidence con- 
sists of correlating the tfist in question with another, welL-established lest that • 
purports to measure the same thing. Sometimes, instead of using another test, 
people use grades or teacher ratings or some other index. Whenever we com- 
pare a test with another measure in t^'\s fashion, it is called concurrent or 
sometimes, criterion- related validitv. 

A second type of validity has to do with the nature of the questions on the 
te.st. In essence, we are asking, '"Are these items a reasonable subset of the 
total pool of items that might be used to measure this trait?" If the answer is 
"yes," we are establishing what is called content validity. 

A third type of validity addresses the question, "Does this test represent a 
■ reasonable way to conceptualize the trait we are trying to measure?" This is a 
little trickier to understand th^n the other two. For example, we could look at 
the Stanford-Binet IQ Scales ai.d ask, "Is this what we mean by intelligence?" 
This type bf validity is called construct validity. It is determined through the 
accumulation of research and development of theory in an area and is diffi- 
cult to determine in a single study. 

In presenting validity evidence, test authors are likely to present a lot of sta- 
tistics that are difficult for the average educator to comprehend. Should you 
find this to be the case, I have two suggestions: 

1. Gfthelp. 

2. Read Buros. 

Reliabilhy: Reliability is easier to talk about than validity. Reliability is a way 
to assess the accuracy of a measure. The simplest way to explain reliability is to 
imagine giving a test twice, about 10 days apart. The reliability coefficient 
would be an index of how similar the results are in the two administrations (it 
is actually the correlation between the two sets of scores). If you are using a 
test to make decisions about individual students (instead of. say, for program 
evaluation), the reliability coefficient should be above .90, and it would be 
much better if it were above .94. Many tests that are sold for individual testing 
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do not ever\ approach these standards. Why should the reliability be so high? 
Because if {,t is Jower, there is far too much chance for error in a student's 
score. I wili^explain more aboutcthis under **standard error of measurement.'* 
A few final words on reliability: There are different ways to measure reljability 
■ (some of wfiich only involve a single adminisjration of the test). Some of the 
more modern approaches to test accuracy do not use reliability coefficients for 
a test as a whole but do provide standard errors of measurement for every pos- 
sible score;(this is the case with the increasingly popular Rasch Model). This is 
fine; in fact, it's superior to a single reliability index. Remember, all tests 
should provide some index of ?core accuracy. 

Standard error of measurement r Although mpst test publishers and measure- 
ment specialists focus on the idea of reliability, it is really the standard error of 
measijremeni (SEM) that is of concern to test useVs. The SEM tells us just how 

* faraway from the truth the student's score might be. Ei ror here does not mean 
mistake; it means uncertainly. If we could give a test to a student a thousand 
times and take his/her average score, we would have a good guess at his/her 

• **true" score on this test. But since we usually giVe a test only once, we need an 
index of how far from typical performance this particular score might be for 
this student. If we take the SEM, double it, and add it to the student's score, it 
will tell us how far off on the low side our observed score might be. If we then 
subtract twice the SEM from the student's score, this will tell us how far off on 
the high side we might be. We can be about 95 percent sure that a student's 
true score will be in this range.* , 

We mentioned before that one should look for reiiabiliiies of .90 and above. 
Actually, a better procedure is to find the SEM and rhen multiply it by four. 
This will tell you the range. of possible scores that you will encounter ( + 2 and 
-2) for each student. For example, some standardized reading tests have 
SEMs of .7 grade-equivalent years. Multiplying by 4, we get a range of almost 
3 grade-equivalent years. Using this SEM, an estimated grade equivalent of 3.6 
might be as high as 5.0 or as low as 2.2. Clearly, thii is unacceptable for mak- 
ing instructional decisions about students. 

Some tests give one SEM for the lest as a whole; others give a different SEM 
for each possible score, smaller toward the middle scores and larger toward the ^ 
extreme scores. This latter procedure is generally preferable, since it is a better 
reflection of the reality of the situation. It is critical for you, the lest user, to be , 
sure that the range given by four times the SEM yields a measure that is accu- 
rate enough for you. 



♦This isn't a very precise descriphon of ihe procedure, hwi \\\ close enough (or mos( purposes. 
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What Do 1 Do With the Results? 

In some respects, if yoil are asking this question you probably shouldn't be 
. testing in the first place. That is, when you are testing you should know in ad- 
vance wbal information you are looking for and how (in what form) you are 
going to receive it. In order to accomplish this, first read the'teacher's manual 
if one' comes with the test. It was probably written with" someone like you in 
mind. If the manxial isn't dear', calj or write the test puTjlisher or author. Don't 
be afraid to question what is in the manual. There are no magic scores in the 
field of measurement. Although "some test scores need "to be interpreted In 
corribination with others"', there are few, if any, scores tha' cannot be inter-' 
preted by a rea.sonably intelligent, experienced educator. 
\ The computer printout you receive may be difficult to decipher, so make 
\ure you read any accompanying materialcarefully. If your printout is impos- 
sible to read, complain to the publishers. They will help you understand the 
pi'intout and may change future ver||ns if they receive enough complaints. 
You might also get some help from your schbol's test' coordi)iator or the 
publisher's representative. You should expect a test salesperson or publisher to 
be able tq explain clearly what you are getting and It.ow to use it.«lf you can't 
get a .sufficiently clear explanation, don't use lh^,iesl. To summarize, three 
points: 

1 . Know what you're getting before you get it. 

2. Read the manual or printout thoroughly. 

3. If necessary, get an explanation from the people who sold you the test in 
the first place. 

Now on to .some suggestions for developing your own-tests. 
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IV. DEVELOPING CLASSROOM TESTS 

» * 

•If you can't find a standardized test that is appropriate for your needs, you 
may want to develop your own test, There are a variety of excellent texts that 
can provide suggestions on how to do this (see Thorndike & Hagen, 1977; 
Groniund, 1976; Bloom, Hastings, & Madaus, 1971; Wick, 1973). 
* • ' In this section, I want to briefly outline one approach that 1 have found to 
be particularly useful. If this method* doesn't seem useful to you, you might 
cortsult one of the works listed. ' ^ * 

There is hothing magic about what foHows: It can be found in some form' in 
most texts. I am presenting it here because if I have gotten you lo'siay with me 
this far, you may profit frbm these ideas even though you may have run across 
them before. 

This is the rationale for what is presented here: You have to know what you 
put into a test in orde^ to understand what you get out of it. Teachers need t,o 
Be very careful about the design of a test in order to have confidence in the 
. results. . * 

In ofder to present these- ideas, ;t might be helpful. to use an example. We 
begin with a need for information abou^how our students are learning. (If we 
don't need information^ We don't ne^pka'iest!) Let's say that we have been 
leaching^a socral studies ^lit on different levels of government for three or 
iFour weeks, and there is one more week, allocated for instruction in this area. It 
■ occurs to us that this last week of insiVuciion would be most profitable if we 
had a good idea of which students already comprehended what material. In 
essence,* we have answered the first question in developing a test: Is this test 
necessary? *^ , 

This question might be e:>cpanded t(|>: Whaf do 1 want to gel out of this test? 
The more'preciselyHhls question can/be answered, the easier the rest becomes 
and the -more jiseCul th? test will be. This is worth focusing on a little more 
closely. Earl^^r, we Said' we wanted a **good idea of which students already 
comprehend what maiprial./* But what does this mean? We need a method for 
specifying the information we want from our test. One way to do this is to use 
a content-behavior m'airix. A detailed discussion on developing such a matrix 
can be found in Bloom, Hastings, and MadSus (1971). The essential idea of a 
cont^t-behavior majrix is to separate what we want students to be able to do 
(behavior) from the maierial or subject matter we want them to do it with 
(content). For instance, with respect to the social studies unit we might be in- 



teresied in the following behaviors: 




1. Defining terms ^ 

2. Understanding the relationships among various elective offices 
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3. Applying the concept of checks Pnd balances 
These might be the content areas we are interested in: 

1. City/municipal governmeni 

2. County government 

3. State government 

4. Federal government'- ' ^ 

Now, all these behaviors may not be related to all the content areas, but if we 
construct a hjalrix, all the possibilities will be apparent, and^we can examine 
them to see which are important to us. 
This has oeen done in Figure 2. 



.FiSj;ure2 ^ 
Content-Behavior Matrix 





Cily 


Coumy 


Siaie 


I-oderal 


Defining lerms 




'^.05 


' . ^.05 




Undersianding relaiionUiips 


-^.10 






\ 


Applying oheoks and balances 






^.10* 





There are 1 2 cells in Figure 2, and an analysis of the combination of content 
and behavior has suggested that nine of these cells are imporiaiit for our test. 
We decided that at the county level we were only interested in terms, and that' 
applying chepks and balances at the city Tevel was of little interest. However, 
not all of these cells are -equally important. It may be that we, are primarily 
interested in our students' understanding of the relationships among elected 
offices at the. state and federal levels. We might assign 25 percent of the test to 
each of thejs categories (50 percent of the total).^Then we mi^ht decide that 
terms are worth 5 percent each (20 percent of the total). Of the remaining three 
cells, we mig'.it decide to allcfcate 10 percent to each. <f his accounts for 100 per- 
cent of our test. 

.these decisions are somewhai arbitrary. What this activity of assigning 
weights to various aspects of a test requires is that a teacher specify and quan- 
tify what is important to his/her instruction. It is clear :hat this matrix 
approach does result in a fairly precise statement of what-wijl be obtained 
from the test. • - ' • .. 

At this point, we are ready for the second question: What kinds of items 
should I use? My answer here is simplesnuiltipie-d^oice and short-answer. 
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Occasionally, I hear a good argument for essay questions. Frequently, 1 hear 
a bad argument for essay questions. The bad argument has to' do with getting 
students to organize thoughts and communicate them Nearly. 1 am all foi 
teaching §tudcnts to do this. I am so much in favor of teaching this skilj that I 
don't think it should be used as a means of measuring something else, In this 
example, organiallbg and communicating thought was not listed as something 
we were interested in. If we are interested in this skill, we should state that ex- 
plicitly, and more important, we should teach students how to do it. Then, we 
can test it. 

The goo.d argument for using essay questions is the same as the bad argu- 
ment. The only difference is that in the good argument, we slate explicitly thai 
we are interested in the skWlSs^and we address the.m in instruction. 

Two other typejs of questions are possible: matching, which I don'l care for 
much; and true-false, fo^ which 1 won't even listen to arguments. (Half of the 
people who don't know the answer to a true-false item get it right anyway.) 
Matching items are good for tying capitals to states, exports to countries, and 
inventors to irlventions. These may be worthwhile, but I can't seem to get ex- 
cited about them. 

This leaves us with multiple-chQice and short-answer items. Short-answer is 
a good item type because it all but eliminates guessing as a factor. Unfortun- 
ately, it is often hard to measure, a good range of abilities with short-answer 
items . Also, scoring can occasionally be ambiguous . 

Multiple-choice items have many advantages: Scoring is quite simple and 
completely objective (in that two people. will score the test the same way); a 
broad range of abilities can be tapped; and the item format is widely used and 
generally understood by nftrst students. It has, however, two substantial draw- 
backs. The first is that it is possible to guess the correct answer. One can never 
he' certain that a correct response indicates competence on the question. The 
second problem is that we cannot tap production, but only recognition of, cor- 
rect responses. Therefore, a mixture of short-answer and multiple-choice items 
is usually a useful format. 

Having decided upon item format, the next question is; How many items? 
The answer here is usually dictated by practical terms. How much time is 
available? About one minute per item is usually a good amount of time to 
allot. Let's say that we decide loAise 30 item^for our example. In order to 
decide how many items to write for each cell, we simply need to multiply the 
proportional weight of each cell (Figure 2) by 30, the number of items we need. 
Often this will involve some rounding. Ddn*i worry if you end up with 38 or 33 
items instead of 30; what is important is that the final distribution of items 
among cells is the way you wanted it to be. For example, in our government 
test we would have three items (30x.lO) on '^Understanding relationships/ 
Cjty'' and'seven or eight (30x .25) on ^^Understanding relationships/State (see 
Figure 2). 
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We have reached the next-to^last question: How do I write items? Tlie 
answer to this question is worth a bobk. I recommend Gronlund (1973) or 
whatever good introductory measurement text is available to you. The issue of 
item writing is too important to receive a cur§ory examination. Let me make 
one suggestion: strive fpr clarity. If a student knows what you mnt him/her lo 
know on a particular item, your goal should be to make it impossible foHhat 
student to get the item vyrong. Conversely, you should try to make it impossi- 
ble for a student who doesn't know the content of an item to get (he right 
answer. To me, the first goal is paramount, For mor.e information, invest sofne 
time in reading about item writing; it will be well worth it. 

iOnce the items are written, putting the test together and administering it is a 
fairly straightforward activity. One suggestion here: Have students make two 
copies of their answers. They can turn one copy in after completing the test, 
,and when all students have finished, you can go over the lest with them imme- 
diately while their responses are still fresh ii: their mi ^ds. 

The final question is: What do I do with tht result ' To answer/his, we need 
to return to the need for the test. Recall that we had v \c week of instruction 
left and were looking for the most profitable way to spend it. To begin our 
analysis of the test results, we might ask: What does the entire class seem to 
need help with? Are there any questions, or cells, that most students did not 
perform well on? A good way to investigate this is to organize all of the ques- 
tions by cell and then list them across the top of a sheet of paper. Next, arrange . 
the students' scores from the highest to the lowest. List their names down the 
side of the paper. This will create a matrix as in Figure 3 on page 26. Now 
mark an '*X" in each cell where a student missed an item. This is simple to do 
and allows for a quick inspection of performance on a cell-by-cell or item-by- 
item basis. : 

Having determined the strengths and weaknesses of the class, you cpn^pow 
do a student-by-studen't analysis. Perhaps students can be put into groups ac- 
cording to common difficulties (for "example, six students may have had trou- 
ble with federal checks and balances; they might work together). In general, 
this kind of analysis allows for statements about: 

1 . * Needs of the class as a whole . 

2. Needs of groups of students. 

3. Needs of individual students. 
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Figure 3 



Analysis of Test Results 
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By examining the pailern of incorrect responses on this partial class dia- 
gram, we c^^n make the following conclusions about instruction: 

^ 1, Most of the class could use some review of understanding the relation- 
ships among elected city offices (items 9-1 1 ). 

2. ,A group of students need help with county terms (items 3, 4), 

3. Item 13 may be poorly writteji. Students had much more trouble with it 
than 12 or 14, which measiu*e the same cell. 

4. Kevin has the terms down but has trouble with relation?v}iips among 
elected officials. 

5. Th^re are obviously other conclusions that can be made from these data. 



We b^gan with a need to know how to proceed with instruction. We con- 
cluded with statements that will help us do just that. If this seems neater and 
tighter than most measurement activities, it is no: jusl happenstance. Useful- 
ness'should l?e a goal, not a fortuitous outcotjie.* 

♦Some needs do noi4ead to such nice turnarounds, hut ihcy arc Mill importani; for example, \\\c 
need for program evaluation, disiriciwide assessment of student progress, and so on. Here the 
payoff is not quite rapid and direct as with classroom instruction. The data provide answers to 
questions such as '^Should we change the Title ! curriculum?" **Has Ccnicrviile solved its bilin-. 
glial education problem?'' Wc should recogni/e that these, too, arc decisions that require data 
even though test results may not lead to such clear and immediate benefit to the le* \ takers. 
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Note that grading siudenls as a result of the test scores was not mentioned. 
It is a different issue. If the test scores are used for grades, we introduce a host 
of new elements into what was previously an uncomplicated procedure. Now 
we have competition, anxiety, stress, and resentment . This is great for training 
advertising executives, but it's a poor way to teach social studies. 

A suggestion before leaving this area: Some of the educators I encounter feel 
they know their students so well that they could fill in rhe X*s in Figure 3 with- 
out giving the test. Even if you aren't that confident, try doing just that some 
time. If you are about to give a test, make a chart like the one in Figure 3 and 
guess what the results will be (perhaps by using O's instead of X's). When the 
results come in, put in the X's and check your accuracy. The differences be- 
tween the X's and the O's are an indication of how much more useful the for- 
mal procedure was than informal 'sReculaiion. 

Finally, writing your own test takes time; time that could be spent on other 
activities. Is writing a test worth the time? is always a consideration. My sug- 
gestion is ta try tire procediii;esHJ7ieniioned in thi>) chaj^ter once or twice, and 
then you'll know what the answ^' should be in your case. 
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V.SOME FINAL THOUGHTS 



In closing, I would like to reiterate what I believe to be the more salient ideas 
presented in this discussion. - 

1. The purpose of measurement (and therefore testing) is to facilitate the 
instructional process! The utility of any testing activity ought to be clear 
and demonstrable. 

2. Testing is a useful way to gather information to facilitate instruction, 
especially with groups of students. That is, I am contending that you can 
use testing effectively to assist instruction (as contrasted with point (1) 
which contended that it OMg/i/to do this whenever it is used). 

3. The negative aspect of lesti/g, from the student's perspective, is largely a 
-femgtion of the consequences of testing rather than the activity itself. 

4. Educators ought to be assertive (even aggressive), knowledgeable con;" 
sumers of standardized tests. Read Buros, consult a measurement spe- 
cialist, and talk to the publisher's representative until you are cvr/o/w that 
you understand what you will be getting out of the lest you buy. 

5. There are a bundle of measurement terms, but they aren't too hard to 
understand. 

6. Constructing your own classroom tests is a straightforward procedure 
that caii be quite useful in instruction //you plan' your construction well. 

■ Know what you are getting out of a test by knowing what you put into it: 

1 began this paper by stating that 1 don't "believe" in tests. What 1 do 
believe in is informed decision making by teachers. This always requires infor- 
mation— sometimes best acquired by testing. If the consequences of testing are 
not threatening, the testing activity loses much of its oppressive connotation. 
Don't take this on faith. Try it. 



\ 



28 



38 



REFERENCES 

^Bloom, B. S., Hastings, J. T. & Madaus» G.F. Handbook on formative and summative 
evflluation of student learning. New York: McGraw-Hill, 1971 . 

Buros, Oscar K. The eighth mental measurements yearbook. Highland Park, NJ.: 
Gryphon Press, 1978... b . 

Glascr, G. R. Instructional technology and the measurement of learning oulcomes. 
American Psychologist, 1963, 18, 519-521 . 

Gronlund, N E. Measurement and evaluation in teaching, (3rd. ed.) New York: Nlac- 
millanPublishlngCo., Inc., 1976. ^ 

I* 

Lortie, D. Schoolteacher, Chicago: Univ. of Chicago Press, 1976. 

Naron, N. K. The Chicagp early project: first year report. Chicago: Chicago Board of 
Educalion, 1977. 

Rapaport, D., Gill, M. M., & Schafer, R. Diagnostic psychological testing, rev. ed.: 
Holl, R. R. (Ed.) New York: Intemalional Universities Press, 1968. 

Thorndike, R. L. & Hagen^ E. Measurement and evaluation in p,sychology and educa- 
tion. New York: John Wiley & Sons, 1977. . . 

Wick, J. W. Educational measurement, Columbus, Oiao: Merrill Publishing Co., 
1973. 



\ 



29 



ERIC 



